Translating MATLAB software to R

my experience with rBAPS

Waldir Leoncio

Oslo Centre for Biostatistics and Epidemiology, UiO

2024-05-30

ExcusesDisclaimers

  1. This project started in 2019, a.k.a. the year 3 BC copilot
  2. Total project time, so far: 450 hours
  3. I was raised by (programming) wolves
    • BS + PhD in Statistics
    • Statistician + affinity for computers = sloppy programmer 😳
    • Knowledge of science + knowledge of software development = great RSE! 😎

The job

Translate core functionality of BAPS from MATLAB to R

  • Initially, 5 273 of 42 461 LOC (i.e., 4 modules)
  • Replace/incorporate some related features

BAPS

  • Bayesian Analysis of (genetic) Population Structure
  • About a dozen methodological papers for BAPS (early 2000s)

OG team

The foreseen challenges

Language issues

  1. I was raised as a 100% R guy (i.e., I’d only heard of MATLAB)
  1. Some of the codebase looks like this:
global POP_LOGML

if ready == 0
  if vaihe == 1
    roundTypes = [1];
  elseif vaihe == 2
    roundTypes = [2 1];
  elseif vaihe == 3
    roundTypes = [5 5 7];
  elseif vaihe == 4
    roundTypes = [4 3 1];
  elseif vaihe == 5
    roundTypes = [6 7 2 3 4 1];
  end
end

% TALLENNETAAN
npops = poistaTyhjatPopulaatiot(npops);

The legacy

  • ☹️ Original dev team (5 PhD students) had long disbanded
  • 😀 Methodological support from Jukka and XP from Gerry were priceless
  • 😭 BAPS 6 (latest version) has no CLI, so debugging = lots of clicking around

The plan

  1. Run the original BAPS to get acquainted
  2. Translate software to R, acquire XP on the go
  3. Publish when it’s ready (“5k LOC? Should take a couple of years, LOL”)

The unforeseen challenges

(buckle up, kiddos)

Getting the original BAPS to run

  • Can’t run the old binaries
  • Can’t recompile the source code (skill issue)
    • Eventually got the original working in Jan 2024
    • Before then, much manual running of single MATLAB functions (bottom-up)
  • Even after running original, demo data wouldn’t run (version issue somewhere, probably)

Top-down vs. bottom-up programming

Go from top-level functions down the rabbit hole or the other way around?

Aspect ⬇ Top-down ⬆ Bottom-up
Chance for errors Higher Lower
Chance to waste coding Lower Higher
Initial speed Fast Slow
  1. Started bottom-up
  2. Got annoyed at 🐢 development speed, switched
  3. Things broke spectacularly, switched back.

Result: 4 years of “one step forward, one step back”

Coming to grips with an alien language and its IDE

<rant><matlab>

“WTF are these shortcuts? Alt+W to copy? Ctrl+Y pastes?”

(FWIW this is all fixed in MATLAB Online)

MATLAB is just weird (for an R user)

  • No default values for function arguments?
  • “WTF is this nargin()?”
  • Inconsistent behavior w.r.t. “holy R”

MATLAB subsets are just weird

if testaaPop(line3)==1
  %2 rivi tällöin lokusrivi
  nloci = rivinSisaltamienMjonojenLkm(line2);
  line4 = fgetl(fid);
  if isequal(line4,-1)
      disp('Incorrect file format'); fclose(fid);
      return
  end
  if ~any(line4==',')
      % Rivin nelj?täytyy sisältää pilkku.
      disp('Incorrect file format'); fclose(fid);
      return
  end
  pointer = 1;
  while ~isequal(line4(pointer),',')  %Tiedetään, ett?pysähtyy
      pointer = pointer+1;
  end
  line4 = line4(pointer+1:end);  %pilkun jälkeinen osa
  nloci2 = rivinSisaltamienMjonojenLkm(line4);
  if (nloci2~=nloci)
      disp('Incorrect file format'); fclose(fid);
      return
  end
end

Are those subsets or function calls?

MATLAB assignments are also weird

>> 10:-1:1

ans =

    10     9     8     7     6     5     4     3     2     1

MATLAB assignments are also weird

>> 10:-1:1

ans =

    10     9     8     7     6     5     4     3     2     1

>> a = max(10:-1:1)

a =

    10

>> [a, b] = max(10:-1:1)

a =

    10


b =

     1

Homonymous functions can work differently

>> A = [1 2 3; 4 5 6]

A =

     1     2     3
     4     5     6

R> (A <- matrix(c(1, 4, 2, 5, 3, 6), 2))
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
>> sum(A)

ans =

     5     7     9
R> sum(A)
[1] 21

This alone was the cause of multiple elusive bugs

</matlab><r>

R vectors are just weird

(x <- c(11, 22, 33))
[1] 11 22 33
(tx <- t(c(11, 22, 33)))
     [,1] [,2] [,3]
[1,]   11   22   33

“So x is not even a row vector?”

R vectors are just weird

(x <- c(11, 22, 33))
[1] 11 22 33
(tx <- t(c(11, 22, 33)))
     [,1] [,2] [,3]
[1,]   11   22   33
dim(x)
NULL
dim(tx)
[1] 1 3

R vectors are just weird

(x <- c(11, 22, 33))
[1] 11 22 33
(tx <- t(c(11, 22, 33)))
     [,1] [,2] [,3]
[1,]   11   22   33
dim(x)
NULL
dim(tx)
[1] 1 3
class(x)
[1] "numeric"
class(tx)
[1] "matrix" "array" 

R vectors are just weird

(x <- c(11, 22, 33))
[1] 11 22 33
(tx <- t(c(11, 22, 33)))
     [,1] [,2] [,3]
[1,]   11   22   33
dim(x)
NULL
dim(tx)
[1] 1 3
class(x)
[1] "numeric"
class(tx)
[1] "matrix" "array" 

Or maybe it’s just the way we rely on c() instead of matrix() for creating vectors?

Nope. “Core R” does it too:

(A <- matrix(11:19, nrow = 3))
     [,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19
class(A[2, ])
[1] "integer"

No, not "double" as before. Don’t get me started on R’s dynamic typing shenanigans.

Nope. “Core R” does it too:

(A <- matrix(11:19, nrow = 3))
     [,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19
class(A[2, ])
[1] "integer"
class(A[2, , drop = FALSE])
[1] "matrix" "array" 

</r></rant>

Porting packages is harder than scripts (well, duh)

  • No idea about how scopes on MATLAB packages work
  • But I could see lots of duplicated/overloaded functions
  • So 5 kLOC became much more

The byproducts

After 450 hours of work:

  1. A still broken R package on GitHub: rBAPS
  2. Preservation of the original code on GitHub: BAPS
  3. Restoration of original webpage’s content on the GH Wiki
  4. A translation layer published on CRAN: matlab2r

The roadmap

  • Line-by-line, parallel debugging
  • Adaptation of original code to run on terminal

i.e., back to basics, taking it slow (which finally seems to be working)

Three Lessons learned

After 4 years working as an RSE

1. Publishing papers vs. software

Understand and disseminate the differences.

Product Chances to publish Patches
Research paper “one” “embarrassing”
Research software “infinite” good
  • Make your MVP as M as possible (4 modules is 3 too many; no extras)
  • Don’t give estimates (or give it in hundreds of hours, if you must)
  • Release ASAP to avoid overhype
  • …but also take it slow and steady
  • No such thing as bug-free software
  • “Art can’t be finished, only abandoned” (Leonardo da Vinci?)

2. Don’t hate the language, hate the IDE

  • Use whatever tool works best for you (even Emacs)
  • But don’t be afraid to explore new things!
    • 2005-2019: “R and RStudio are all I care about”
    • 2020-2024: “I love all languages (even MATLAB). VSC and vim are all I need!”
    • 20XX: ?

3. Future-proof your ideas: make them public

  • It’s a scientific community, after all
  • Relax, nobody’s gonna steal your idea
  • And if they do, that’s what OS licenses are for
  • And yes, they are legally valid (IANAL, so check with your local L)

Publishing your code actually protects it!

Commit and push timestamps your work

  • Write object names Samuel could easily pronounce
  • Never feel bad about your contribution. Everybody is embarrased of code they wrote years months ago!

Thank you for your attention!

Thank you for hosting us! ❤️ 🇫🇮

Github: wleoncio, ocbe-uio